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(57) ABSTRACT 

A fault of an operating system (hereafter referred to as "OS") 
is observed without adding dedicated hardware. Besides AP 
fault monitoring, check-point information is preserved at 
high speed without adding dedicated hardware. In a com- 
puter having a multi-OS environment, an AP fault monitor 
and a monitored AP operate on a first OS. A fault monitor 
operates on a second OS. Each of the first OS and the AP 
fault monitor transmits an alive message to the fault monitor 
by utilizing an inter-OS communication function of a multi- 
OS controller. The fault monitor monitors the alive message 
of each of the first OS and the AP fault monitor. The 
monitored AP transmits an alive message to the AP fault 
monitor. The AP fault monitor monitors it. The monitored 
AP preserves check-point information in a region of a 
memory. In the region of the memory, information is pre- 
served by a fault and restart of the first OS as well. 

9 Claims, 6 Drawing Sheets 
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METHOD FOR MONITORING FAULT OF 
OPERATING SYSTEM AND APPLICATION 
PROGRAM 



BACKGROUND OF THE INVENTION 

The present invention relates to a fault monitoring method 
for computer, and in particular to a fault monitoring method 
of an operating system (OS) and an application program 
(AP). 

In general, an AP encounters a fault for several reasons 
and stops. Main causes are defects included in the AP itself 
and faults of an OS on which the AP is mounted. As for such 
an AP that its operation needs to be continued, the following 
method has been adopted. A fault is monitored. If a fault has 
been detected, then the operation of the AP is resumed from 
a state in which it was before the occurrence of the fault, and 
thereby recovery of the AP from the fault is attempted. 

As one of fault monitoring methods for AP and OS, there 
is a method called watchdog. "Fault Tolerant Computer" 
written by Takashi NANYA, published by OHM-SHA says 
"The watchdog timer is a different process which is inde- 
pendent of a monitored process. Hie monitored process is so 
designed that a timer is reset at intervals of a fixed time (for 
example, in the range of several microseconds to several 
minutes) during the execution of the monitored process. If 
the timer is not reset until the time is up, some fault is 
considered to have occurred in the monitored process/' 

In the case where a fault of an AP is monitored, an AP 
fault monitor having a watchdog receives a periodical alive 
message from the AP. If the alive message stops for a 
predetermined time, the AP fault monitor judges an AP fault 
to have occurred and restarts the AP. In the case where a fault 
of an OS is monitored, an OS fault monitor having a 
watchdog receives a periodical alive message from the OS. 
If the alive message disappears for a predetermined time, the 
OS fault monitor judges an OS fault to have occurred and 
restarts the OS. 

Furthermore, for recovery of an AP from a fault, a 
technique called check-point is used. In the check-point 
technique, execution information of the AP is acquired from 
the monitored AP periodically and reserved. When a fault 
has occurred, the reserved execution information is taken 
out, and processing of the AP is resumed from a check-point. 

SUMMARY OF THE INVENTION 

As for an AP fault monitor using software, the AP fault 
monitor operates on the same OS as the monitored AP. In 
some cases, therefore, the AP fault monitor using software 
cannot cope with such an AP fault caused by the OS. 
Furthermore, there is known such an OS fault monitoring 
method that a watchdog is formed of dedicated hardware 
and the watchdog monitors a periodical alive message 
supplied from an OS. However, there is a problem that 
dedicated hardware must be prepared. 

On the other hand, as for the check-point, if check-point 
information is preserved in a low speed memory, then it 
takes a considerable time to preserve the check-point 
information, and consequently the check-point repetition 
period becomes long. As a result, recovery from a fault must 
be conducted from a state of a considerably long time before 
the time of the fault occurrence. Furthermore, it is possible 
to provide a high speed non-volatile memory, such as a static 
RAM, apart from a volatile memory, such as a dynamic 
RAM, managed by the OS, and preserve check-point infor- 
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mation in the high speed non-volatile memory. However, 
there is a problem that dedicated hardware must be prepared. 

An object of the present invention is to provide a method 
for monitoring a fault of the OS by using software, without 

5 adding dedicated hardware. 

Another object of the present invention is to provide a 
method for monitoring a fault of an AP and preserving 
check-point information of the AP at high speed without 
adding dedicated hardware. 

The present invention solves the above described prob- 
lems. In accordance with the present invention, in an oper- 
ating system fault monitoring method for a computer, the 
computer including a first OS, a second OS different from 

15 the first OS, a multi-OS controller for managing computer 
resources, the multi-OS controller having inter-OS commu- 
nication means between the first OS and the second OS, and 
a fault monitor operating on the second OS, the operating 
system fault monitoring method includes the steps of trans- 

2(J mining an alive message from the first OS to the fault 
monitor via the inter-OS communication means, and deter- 
mining whether the alive message has been received by the 
fault monitor within a predetermined time. 

In accordance with the present invention, in an applica- 

25 tion program fault monitoring method for a computer, the 
computer including an AP fault monitor operating on the 
first OS, and a high rank fault monitor operating on a second 
OS to monitor not only a fault of the first OS but also a fault 
of the AP fault monitor via inter-OS communication means, 

30 the application program fault monitoring method includes 
the step of monitoring a fault of the AP fault monitor 
operating on the first OS by using the high rank fault 
monitor. An AP to be monitored by the AP fault monitor 
preserves check-point information in a shared memory 

35 region on a main memory. In the shared memory region, 
information is preserved by a fault and restart of the first OS 
as well. 

According to the present invention, fault monitoring of an 
OS to be monitored is conducted by utilizing a multi-OS 

40 environment and using a high rank fault monitor operating 
on another OS, as heretofore described. Without adding 
dedicated hardware, therefore, a fault of the OS can be 
monitored. Furthermore, a fault of an AP fault monitor can 
also be monitored by using the high rank fault monitor. 

45 Furthermore, since the high rank fault monitor using soft- 
ware is used, a recovery method to be used when a fault of 
each of the OS and AP fault monitor has occurred can be set 
finely. In addition, a monitored AP can preserve check-point 
information at high speed without adding dedicated hard- 

50 ware. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a diagram showing the configuration of a 
computer forming an embodiment of the present invention; 

FIG. 2 is a diagram showing the configuration of a storage 
region in a memory, in the embodiment shown in FIG. 1; 

FIG. 3 is a flow chart showing the procedure of start 
processing of each program, in the embodiment shown in 

60 fig- i; 

FIG. 4 is a flow chart showing the procedure of OS 
monitoring processing conducted by a high rank fault moni- 
tor 3 and recovery processing of a relating program at the 
time of an OS fault, in the embodiment shown in FIG. 1; 
65 FIG. 5 is a flow chart showing the procedure of monitor- 
ing processing of an AP fault monitor conducted by a high 
rank fault monitor 3 and recovery processing of a related 
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program at the time of a fault of the AP fault monitor, in the is not initialized because it is located outside the OSs' 

embodiment shown in FIG. 1; and regions, and its contents are preserved. After restart, each OS 

FIG. 6 is a flow chart showing the procedure of monitor- can utilize a pseudo-non-voiatile memory or a software 

ing processing of a monitored AP conducted by an AP fault manner by specifying the shared memory identifier (ID) 

monitor 4 and recovery processing of a related program of 5 used the last time or a pointer to a physical memory. In the 

the monitored AP at the time of a fault, in the embodiment embodiment of the present invention, the shared memory 

shown in FIG. 1. region is utilized as a check-point information storage region 

of the monitored AP. 

DESCRIPTION OF THE EMBODIMENT (d) OS Fault Notice and Fault Recovery Function 

TT - . ,. t - . . . .. , 10 The OS fault notice is a function of giving notice that a 

Hereafter, an embodiment of the present invention will be r , ,. ^ . nc , , 4 , * „ TT 

described occurred in a certain OS to the other OS. However, 

noticed faults are only faults that the OS itself is conscious 

FIG. 1 is a diagram showing the configuration of a of. Furthermore, there is a function of conducting recovery 

computer 100 in the present embodiment. The computer 100 of a os from a fault 5y restarting only the OS. 

can be divided broadly into a processor 10 and a memory 15 ^ mon it or ed AP 5 transmits an alive message to the AP 

(main storage) 9. An input/output device and a storage mon itor 4 periodically. The AP monitor 4 determines 

connected to the processor 10 are not illustrated. The whether it has received an alive message from the monitored 

memory 9 stores programs of a first OS 1, a second OS 2, ^5 a predetermined fault detection time. Unless an 

a high rank fault monitor (hereafter referred to as "high rank ^ vc meS sage has been received, the AP monitor 4 judges 

monitor") 3, an AP fault monitor (hereafter referred to as 20 the st0 pp age t0 have been caused by a fault of the monitored 

"AP monitor") 4, a monitored AP 5, and a multi-OS con- ^ 5 In accordance with a preset method for recovery from 

troller 6. These programs are executed by the processor 10. a fault, the AP monitor 4 conducts recovery processing of the 

Furthermore, the memory 9 includes a shared memory monitored AP 5 in which a fault has occurred, 

region 11 which can be accessed by both the first OS 1 and nc first os t and tne moa i iox 4 periodically transmit 

the second OS 2. 25 a ^ ve merges t 0 the high rank monitor 3 by utilizing the 

The monitored AP 5 is an application mounted on the first inter-OS message passing function of the multi-OS control- 
OS 1. The monitored AP 5 may be a program for conducting ler 6 independently respectively. The high rank monitor 3 
general business application. The AP monitor 4 is an appli- determines whether it has received an alive message from 
cation mounted on the first OS 1, and it is a program for each of the first OS 1 and the AP monitor 4 within a 
conducting fault monitoring of the monitored AP 5. The 30 predetermined fault detection time. Unless an alive message 
second OS 2 is an OS which is different from the first OS 1. has been received, the high rank monitor 3 judges the 
The second OS 2 is desired to be an OS regarded as an OS stoppage to have been caused by a fault of pertinent one of 
having a higher reliability than the first OS 1. The high rank the first OS 1 and the AP monitor 4. In accordance with a 
monitor 3 is an application mounted on the second OS 2, and preset method for recovery from a fault, high rank monitor 
it is a program for conducting fault monitoring of the first 35 3 conducts recovery processing of the first OS 1 or the AP 
OS 1 and the AP monitor 4. monitor 4 in which a fault has occurred. 

The multi-OS controller 6 has the following functions. The monitored AP 5 periodically stores check-point infor- 

(a) Hardware Resource Division Function mation in the shared memory region U of the memory 9. 
The multi-OS controller 6 manages computer resources ^When-the-monitored AP 5 is restarted due to a fault of the 

such as CPU time, the memory, external devices, and timers, 40 first OS 1, the AP monitor 4, or the monitored AP 5, the 

and divides these resources and provides them to respective monitored AP 5 takes out check-point information from the 

OSs. For example, as for timers, the multi-OS controller 6 shared memory region 11, and resumes processing of the 

has a function of virtually providing only one hardware monitored AP 5 from the check-point, 

timer to respective OSs. Furthermore, the multi-OS control- In the present embodiment, only the high rank monitor 3 

ler 6 has a function of reserving an I/O resource of an 45 is mounted on the second OS 2. The second OS 2 need only 

external device utilized by respective OSs. have a function for responding to a request from the high 

(b) Inter-OS Message Passing Function rank monitor 3. As compared with the first OS 1, the second 
This is a function for allowing processes on respective OS 2 can be regarded as an OS which is small in overhead 

OSs to conduct message passing. Unlike the shared memory of CPU time and highly reliable. Typically, the memory 

communication, the process of the other party of commu- 50 occupation quantity of the second OS 2 is smaller than the 

nication can be called. In the embodiment of the present memory occupation quantity of the first OS 1. 

invention, this message passing function is utilized as an It is possible to store the first OS 1, the second OS 2, the 

alive message. high rank monitor 3 and the multi-OS controller 6, or the 

(c) Inter-OS Shared Memory Communication Function first OS 1, the second OS 2, the high rank monitor 3, the AP 
The multi-OS controller 6 secures a memory region which 55 monitor 4 and the multi-OS controller 6 into a storage 

can be shared between OSs as a physical memory region medium, read them into the memory 9 via a drive device 

independent of physical memory regions for respective OSs. connected to the processor 10 or transmit them to the 

The shared memory region is divided into several blocks and memory 9 of another computer via a drive device connected 

managed. Each block is provided with its identifier (ID). By to the computer, the computer, and a network, and execute 

utilizing this shared memory identifier (ID) or a pointer to 60 them by using the processor 10. 

the physical memory, each OS can access the shared FIG. 2 is a diagram showing the configuration of a storage 

memory. Supposing that information is not recorded in the region in the memory 9. However, a program region and the 

memory region, each OS conducts memory initialization like are not illustrated. With regard to a region for storing 

after it has been started. For example, when an OS is control information and the like concerning the present 

restarted at the time of a fault of the OS, a memory region 65 invention, its data configuration is shown, 

in which the state and so on of an application has been High rank monitor control information 12 is information 

^preserved is initialized. However, the shared memory region placed in a memory region of the second OS 2 and accessed 
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by the high rank monitor 3. 'As control information 12, OS 
/fault detection time 13 and an OS fault recovery method 14 
are stored with regard to a monitored subject ID 18, and AP 
monitor fault detection time 15 and an AP monitor fault 
recovery method 16 are stored with regard to a monitored 5 
subject ID 20. The monitored subject ID 18 is an identifier 
"provided to the first OS 1. The OS fault detection time 13 is 
time allowed until an alive message is received from the first 
OS 1 when the high rank monitor 3 monitors a fault of the^ 
first OS 1. The OS fault recovery method 14 is information 10 
which indicates a recovery method at the time of a fault of 
the first OS 1. The OS fault recovery method 14 is divided 
into OS restart without memory dump, OS restart after 
memory dump acquisition, restart of only a specific function 
of an OS, and so on. The monitored subject ID 20 is an 15 
identifier provided to the AP monitor 4. AP monitor fault 
detection time 15 is time allowed until an alive message is 
received from the AP monitor 4 when the high rank monitor 
3 monitors a fault of the AP monitor 4. The AP monitor fault 
recovery method 16 is information which indicates a method 20 
to be used when the AP monitor 4 is recovered from a fault. 
The AP monitor fault recovery method 16 is divided into 
(restart of the AP monitor 4 and the monitored AP 5, r restart , 
of "the first OS 1, the AP monitor 4 and the moriitorechAP 5/ 
and so on. 25 

Since the AP monitor 4 is considered to be higher in 
reliability than the first OS 1, a fault of the AP monitor 4 is 
caused in some cases by a fault of the first OS 1. In this case, 
a series of restart operations beginning with restart of the 
first OS 1 followed by restart of the AP monitor 4 are 30 
sometimes effective. 

', First OS control information 17 is information placed in 
i a memory region of the first OS 1 and accessed by the AP 
monitor 4. Monitored subject ID 18 is stored as the first OS 
j control information 17. Control information required to 35 
restart the first OS 1 is stored. 

AP monitor control information 19 is information placed 
in the memory region of the first OS 1 and accessed by the 
AP monitor 4. Monitored subject ID 20, AP fault detection 
time 21, and an AP fault recovery method 22 are stored as 40 
the AP monitor control information 19. The AP fault detec- 
tion time 21 is time allowed until an alive message is 
received from the monitored AP 5 when the AP monitor 4 
monitors a fault of the monitored AP 5. The AP fault 
recovery method 22 is information which indicates a method 45 
to be used when the AP monitor 4 recovers the monitored AP 

5 from a fault. The AP fault recovery method 22 is divided 
into restart of only the monitored AP 5, restart of the AP 
monitor 4 and the monitored AP 5, restart of the first OS 1, 
the AP monitor 4 and the monitored AP 5, and so on. For the 
same reason as the case of the recovery of the AP monitor 
4, a series of restart operations beginning with restart of the 
first OS 1 followed by restart of the monitored AP 5 are 
sometimes effective. 

Monitored AP control information 23 is information 
placed in the shared memory region 11 and accessed by the 
monitored AP 5. A head block of a check-point region 25 is 
stored as check-point identifier ID 24. The check-point 
region 25 is a region for storing check-point information 
placed in the shared memory region 11 and accessed by the 
monitored AP 5. 

FIG. 3 is a flow chart showing the procedure of start 
processing of respective programs conducted when the 
multi-OS computer 100 which is the embodiment of the 
present invention is started. After the processor 10 is started, 65 
the first OS 1, the second OS 2 and the multi-OS controller 

6 are started by execution of an automatic procedure (step 



50 



55 



60 



31). Upon being started, the multi-OS controller 6 conducts 
resource division using assignment of computer resources, 
such as CPU time, memories, and external devices, utilized 
by the first OS 1 and the second OS 2, by referring to a 
configuration management file which is not illustrated. 
Subsequently, the second OS 2 starts the high rank monitor 
3 (step 32). Its start procedure is the same as an ordinary start 
procedure for starting an AP on the second OS 2. Its start is 
conducted manually, or automatically by executing a com- 
mand line on a file. 

Subsequently, the first OS 1 transmits OS fault detection 
time and an OS fault recovery method to the high rank 
monitor 3 via the multi-OS controller 6 and the second OS 

2 by utilizing the inter-OS message passing function of the 
multi-OS controller 6. The high rank monitor 3 receives 
such information and sets the information in the memory 9 
as the OS fault detection time 13 and the OS fault recovery 
method 14 (step 33). The high rank monitor 3 issues a 
monitored subject ID (for example, 01), adds it to the OS 
fault detection time 13 and the OS fault recovery method 14 
on the memory. In addition, the high rank monitor 3 trans- 
mits the issued monitored subject ID to the first OS 1 via the 
second OS 2 and the multi-OS controller 6 by utilizing the 
inter-OS message passing function. The first OS 1 receives 
it and sets it in the memory 9 as the monitored subject ID 18 
of the first OS control information 17. Thereafter, the high 
rank monitor 3 starts monitoring of the first OS 1 by 
referring to the OS fault detection time 13 (step 34). In other 
words, the first OS 1 transmits an alive message to the high 
rank monitor 3 via the multi-OS controller 6 and the second 
OS 2 by using the inter-OS message passing function. The 
high rank monitor 3 receives this alive message, and deter- 
mines whether the alive message has been received within 
the OS fault detection time 13. 

Subsequently, the first OS 1 starts the AP monitor 4 (step 
35). Its start procedure is the same as the ordinary start 
procedure for starting an AP on the first OS 1. The AP 
monitor 4 is automatically started manually or automatically 
by executing a command line on a file. Subsequently, the AP 
monitor 4 transmits AP monitor fault detection time and an 
AP monitor fault recovery method to the high rank monitor 

3 via the first OS 1, the multi-OS controller 6 and the second 
OS 2 by utilizing the inter-OS message passing function. 
The high rank monitor 3 receives such information and sets 
the information in the memory 9 as the AP monitor fault 
detection time 15 and an AP monitor fault recovery method 
16 of the monitor control information 12 (step 36). The high 
rank monitor 3 issues a monitored subject ID (for example, 
02), and adds it to the AP monitor fault detection time 15 and 
an AP monitor fault recovery method 16 on the memory. In 
addition, the high rank monitor 3 transmits the issued 
monitored subject ID to the AP monitor 4 via the second OS 
2, the multi-OS controller 6 and the first OS 1 by utilizing 
the inter-OS message passing function. The AP monitor 4 
receives it and sets it in the memory 9 as the monitored 
subject ID 20 of the AP monitor control information 19. 
Thereafter, the high rank monitor 3 starts monitoring of the 
AP monitor 4 by referring to the monitor fault detection time 
15 (step 37). In other words, the AP monitor 4 transmits an 
alive message to the high rank monitor 3 via the first OS 1, 
the multi-OS controller 6 and the second OS 2. The high 
rank monitor 3 receives this alive message, and determines 
whether the alive message has been received within the AP 
monitor fault detection time 15. 

Subsequently, the first OS 1 starts the monitored AP 5 
(step 38). The monitored AP 5 issues a request for the region 
of the check-point identifier ID 24 and the check-point 
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region 25 to the multi-OS controller 6 via the first OS 1, and and the monitored subject ID 18, and resets the OS fault 

acquires them (step 39). Furthermore, the monitored AP 5 detection time 13, the OS fault recovery method 14, and the 

sets the identifier ID of the check-point region 25 in the monitored subject ID 18 in accordance with the above 

check-point identifier ID 24 of the monitored AP control described procedure. The high rank monitor 3 transmits the 

information 23. Subsequently, the monitored AP 5 transmits 5 monitored subject ID 18 to the first OS 1, and then restarts 

AP monitor fault detection time and an AP monitor fault lhe fault monitoring of the first OS 1. 

recovery method to the AP monitor 4 by utilizing the In the case where the first os 1 15 restarted due to the OS 

inter-process communication function of the first OS 1. The f ™ ]i detection, the AP monitor 4 is also forcibly stopped. 

AP monitor 4 receives such information and sets the infer- ™erefore restart of the AP monitor 4 and resetting of the 

. tU a iU AD o t4 , . . , control information are conducted (step 55). In the same 

mation in the memory 9 as he AP fault detection time 21 and 10 ^ ^ 3 ^ ^ ^ monitor ^ 

an AP fault recovery method 22 of the monitor control detection time 15, the AP monitor fault recovery method 16, 

information 19 (step 40). Thereafter, the AP monitor 4 starts and ^ mooitored ID 20> and resets the ^ monitor 

monitoring of the monitored AP 5 by referring to the AP fault detection time 15) ^ ap monitor fault reC overy 

fault detection time 21 of the AP monitor control informa- method 16 and the mon i torcd sub j cc t ID 20 in accordance 

tion 19 (step 41). In other words, the monitored AP 5 15 with aDO ve described procedure. The high rank monitor 

transmits an alive message to the AP monitor 4 by using the 3 transmits the monitored subject ID 20 to the AP monitor 

inter-process communication function. The AP monitor 4 4, and then restarts the fault monitoring of the AP monitor 

receives this alive message, and determines whether the 4. 

alive message has been received within the AP fault detec- Subsequently, the high rank monitor 3 conducts restarts of 

tion time 21. 20 the monitored AP 5 and resetting of the control information 

The monitored APS advances processing of the AP, while (step 56). The monitored AP 5 recognizes its restart, and 

periodically storing check-point information in the check- resets the AP fault detection time 21 and the AP fault 

point region 25 indicated by the check-point identifier ID 24. recovery method 22 in accordance with the above described 

FIG. 4 is a flow chart showing the procedure of OS procedure. Subsequently, the monitored AP 5 acquires 

monitoring processing conducted by the high rank monitor 25 check-point information from the check-point region 25 

3 and recovery processing of a relating program at the time indicated by the check-point identifier ID 24 (step 57) The 

of an OS fault. The first OS 1 periodically executes a AP monitor 4 resets the jAP fault detection time 21 and the 

nn o1 - , a * . f . . „ n . AP fault recovery method 22, and then restarts the fault 

program tor transmitting an alive message to the men rank . . . , J . T iL 

a . • - * ^0° monitoring of the monitored APS (step 58). In the same way 

monitor 3 by utilizing the inter-OS message passing c & . - . „. -. j A n* - J 

£ j ,1 1 , t . „ t . v as the foregoing description, the monitored AP 5 penodi- 

function and [thereby automatically transm.ts an alive mes- 30 store ! check-point information in the check-point 

sage to the high rank monitor 3. 1 ne transmission period ol xceion 25 

the alive message is a time within the OS fault detection time FIG s \ a flow chart showing me procedure of AP 

13. The alive message includes the monitored subject ID 18 mon itor monitoring processing conducted by the high rank 

and a one-bit signal (1 or 0) indicating an alive or dead state. monitor 3 and recovery processing of a relating program at 

By referring to the monitored subject ID 18 of the high rank 35 mc time of an AP monitor fault. The AP monitor 4 calls a 

monitor control information 12, the high rank monitor 3 function for transmitting an alive message to the high rank 

determines whether an alive message has come from the first monitor 3, and periodically transmits an alive message to the 

OS 1 (step 51). When the high rank monitor 3 has received high rank monitor 3. The transmission period of the alive 

an alive message indicating an alive state (YES of step 51), message is a time within the AP monitor fault detection time 

a timer is reset and the processing returns to the step 51. If 40 15. The alive message includes the monitored subject ID 20 

an alive message indicating an alive state is not received and a one-bit signal (1 or 0) indicating an alive or dead state. 

(NO of step 51) and the OS fault detection time 13 has not By referring to the monitored subject ID 20 of the high rank 

elapsed (NO of step 52), then the processing returns to the monitor control information 12, the high rank monitor 3 

step 51. determines whether an alive message has come from the AP 

If the OS fault detection time 13 has elapsed (YES of step 45 monitor 4 (step 61). When the high rank monitor 3 has 

52), then the high rank monitor 3 judges the first OS 1 to received an alive message indicating an alive state (YES of 

have stopped due to a fault, and conducts processing for step 61), a timer is reset and the processing returns to the 

recovery from the OS fault in accordance with the OS fault step 61. If an alive message indicating an alive state is not 

recovery method 14 (step 53). The high rank monitor 3 received (NO of step 61) and the AP monitor fault detection 

requests the multi-OS controller 6 to conduct the recovery 50 time 15 has not elapsed (NO of step 62), then the processing 

processing of the first OS 1 from a fault. The multi-OS returns to the step 61. 

controller 6 conducts the recovery processing of the first OS If the AP monitor fault detection time 15 has elapsed 

1 from a fault. If fault information such as a memory dump (YES of step 62), then the high rank monitor 3 judges the AP 

of the first OS 1 is recorded in a disk or the like, it can be monitor 4 to have stopped due to a fault, and conducts 

used to pursue the cause of a fault of the first OS 1 later. In 55 processing for recovery of the AP monitor 4 from the fault 

this case, however, restart of the first OS 1 is delayed. There in accordance with the AP monitor fault recovery method 16. 

is trade-ofif between the requested restart time of the first OS Here, the AP monitor fault recovery method 16 conducts 

1 and the merit of recording. If the recovery processing is restart beginning from the AP monitor 4 and resetting qf_ 

finished, then restart of the first OS 1 and resetting of control {xontrol information (step 63). The high rank monitor 3151ears 

information are conducted (step 54). Even if a fault occurs 60 the^AP monitor fault detection time 15, the AP monitor fault 

in the first OS 1 and the first OS is restarted, the high rank recovery method 16, and the monitored subject ID 20, and 

monitor control information 12 located in the memory resets the AP monitor fault detection time 15, the AP monitor 

region of the second OS 2 is preserved. Since the high rank fault recovery method 16, and the monitored subject ID 20 

monitor 3 needs to detect the timing of restart termination of in accordance with the above described procedure. The high 

the first OS 1 and the OS fault recovery method 14 needs to 65 rank monitor 3 transmits the monitored subject ID 20 to the 

be changeable, however, the high rank monitor 3 clears the AP monitor 4, and then restarts the fault monitoring of the 

OS fault detection time 13, the OS fault recovery method 14, AP monitor 4. 
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It is necessary to do start setting of the monitored AP 5 
over again, with the restart of the AP monitor 4. Therefore, 
restarting of the monitored AP 5 and resetting of control 
information are conducted (step 64). The monitored AP 5 
recognizes its restart, and resets the AP fault detection time s 
21 and the AP fault recovery method 22 in accordance with 
the above described procedure. Subsequently, the monitored 
AP 5 acquires check-point information from the check-point 
region 25 indicated by the check-point identifier ID 24 (step 
65). The AP monitor 4 resets the AP fault detection time 21 10 
and the AP fault recovery method 22, and then restarts the 
fault monitoring of the monitored APS (step 66). In the same 
way as the foregoing description, the monitored AP 5 
periodically stores check-point information in the check- 
point region 25. 15 

If restart beginning from the first OS 1 is set in the AP 
monitor fault recovery method 16, the high rank monitor 3 
conducts the processing of the step 53, and thereafter the 
processing of the steps 54 to 58 is conducted. 

FIG. 6 is a flow chart showing the procedure of monitored 20 
AP fault monitoring processing conducted by the AP moni- 
tor 4 and recovery processing of a relating program at the 
time of a monitored AP fault. The monitored AP 5 calls a 
function for transmitting an alive message to the AP monitor 
4, and periodically transmits an alive message to the AP 25 
monitor 4. The transmission period of the alive message is 
a time within the AP fault detection time 21. The alive 
message includes a 1-bit signal indicating an alive or dead 
state. The AP monitor 4 determines whether an alive mes- 
sage has come from the monitored AP 5 (step 71). When the 30 
AP monitor 4 has received an alive message indicating an 
alive state (YES of step 71), a timer is reset and the 
processing returns to the step 71. If an alive message 
indicating the alive state is not received (NO of step 71) and 
the AP fault detection time 21 has not elapsed (NO of step 35 
72), then the processing returns to the step 71. 

If the AP fault detection time 21 has elapsed (YES of step 
72), then the AP monitor 4 judges the monitored AP 5 to 
have stopped due to a fault, and conducts processing for 
recovery of the monitored AP 5 from the fault in accordance 40 
with the AP fault recovery method 22. Here, assuming that 
the AP fault recovery method 22 sets restart of only the 
monitored AP 5, restart of the monitored AP 5 and resetting 
of control information are conducted (step 73). The moni- 
tored AP 5 recognizes its restart, and resets the AP fault 45 
detection time 21 and the AP fault recovery method 22 in 
accordance with the above described procedure. 
Subsequently, the monitored AP 5 acquires check-point 
information from the check-point region 25 indicated by the 
check-point identifier ID 24 (step 74). The AP monitor 4 50 
resets the AP fault detection time 21 and the AP fault 
recovery method 22, and then restarts the fault monitoring of 
the monitored AP 5 (step 75). In the same way as the 
foregoing description, the monitored AP 5 periodically 
stores check-point information in the check-point region 25. 55 

If restart beginning from the AP monitor 4 is set in the AP 
fault recovery method 22 and restart beginning from the AP 
monitor 4 is set in the AP monitor fault recovery method 16, 
then the AP monitor 4 transmits an alive message indicating 
death of the AP monitor 4 to the high rank monitor 3, and 60 
thereby processing of steps 63 to 66 is conducted. If a series 
of restart operations beginning from restart of the first OS 1 
and followed by restart of the AP monitor 4 is set in the AP 
fault recovery method 22 and a series of restart operations 
beginning from restart of the first OS 1 and followed by 65 
restart of the AP monitor 4 is set in the AP monitor fault 
recovery method 16, then the high rank monitor 3 conducts 
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processing of the step 53, and thereafter processing of steps 
54 to 58 is conducted. 
What is claimed is: 

1. A fault monitoring method in a computer including a 
first OS, a second OS, and a multi-OS controller having a 
communication function between said first OS and said 
second OS, said fault monitoring method comprising the 
steps of: 

monitoring, in an AP fault monitor activated by execution 
of a program on said first OS, a fault of an application 
(hereafter referred to as "AP") activated by execution 
of a program on said first OS; 

successively receiving, in a high rank fault monitor acti- 
vated by execution of a program on said second OS, a 
monitor alive message indicating an alive state of said 
AP fault monitor transmitted from said AP fault 
monitor, by using an inter-OS communication function 
of said multi-OS controller; 

successively receiving, in said high rank fault monitor, an 
OS alive message indicating an alive state of said first 
OS transmitted from said first OS, by using an inter-OS 
communication function of said multi-OS controller; 
and 

in said high rank fault monitor, restarting said AP fault 
monitor and recovering said AP when a reception 
interval of said AP alive message has exceeded a first 
predetermined time, and restarting said first OS and 
said AP fault monitor and recovering said AP when a 
reception interval of said OS alive message has 
exceeded a second predetermined time. 

2. A fault monitoring method according to claim 1, 
wherein said high rank fault monitor determines whether 
said AP alive message and said OS alive message have 
exceeded said first and second predetermined time, 
respectively, by comparing the reception intervals of said AP 
alive message and said OS alive message with predeter- 
mined time information pieces respectively stored in a 
memory included in said computer. 

3. A fault monitoring method according to claim 2, 
wherein when restarting said first OS said high rank fault 
monitor restarts said first OS in accordance with a restart 
method stored in said memory, when restarting said AP fault 
monitor said high rank fault monitor restarts said AP fault 
monitor in accordance with a restart method stored in said 
memory, and when recovering said AP said high rank fault 
monitor recovers said AP based on check-point information 
stored in a shared memory region in said computer. 

4. A fault monitoring method according to claim 1, 
wherein when said AP fault monitor has sensed a fault of 
said AP, said AP is recovered based on check-point infor- 
mation stored in a shared memory region in said computer. 

5. A multi-OS computer having a fault monitor function, 
comprising: 

a processor; 
a memory; 

a first OS loaded on said memory by said processor; 
a second OS loaded on said memory by said processor; 
a multi-OS controller having a communication function 

between said first OS and said second OS, loaded on 

said memory by said processor; 
an AP activated by execution of a program on said first 

OS; 

an AP fault monitor for monitoring a fault of said AP, said 
AP fault monitor being activated by execution of a 
program on said first OS; and 
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a high rank fault monitor for monitoring a fault of said AP 
fault monitor, said high rank fault monitor being acti- 
vated by execution of a program on said second OS, 

wherein by using the inter-OS communication function of 
said multi-OS controller, said high rank fault monitor 5 
successively receives a monitor alive message indicat- 
ing an alive state of said AP fault monitor transmitted 
from said AP fault monitor, 

wherein by using the inter-OS communication function of 
said multi-OS controller, said high rank fault monitor 
successively receives an OS alive message indicating 
an alive state of said first OS transmitted from said first 
OS, and 

wherein said high rank fault monitor restarts said AP fault J5 
monitor and recovers said AP when a reception interval 
of said AP alive message has exceeded a first prede- 
termined time, and said high rank fault monitor restarts 
said first OS and said AP fault monitor and recovers 
said AP when a reception interval of said OS alive 2Q 
message has exceeded a second predetermined time. 

6. A multi-OS computer according to claim 5, wherein 
said high rank fault monitor determines whether said AP 
alive message and said OS alive message have exceeded 
said first and second predetermined time, respectively, by 25 
comparing the reception intervals of said AP alive message 
and said OS alive message with predetermined time infor- 
mation pieces respectively stored in said memory. 

7. A multi-OS computer according to claim 6, further 
comprising a shared memory region in said memory, said 3Q 
shared memory region being capable of being accessed from 
said first OS and second OS, 

wherein when restarting said first OS said high rank fault 
monitor restarts said first OS in accordance with a 
restart method stored in said memory, when restarting 
said AP fault monitor said high rank fault monitor 



restarts said AP fault monitor in accordance with a 
restart method stored in said memory, and when recov- 
ering said AP said high rank fault monitor recovers said 
AP based on check-point information stored in said 
shared memory region. 

8. A storage medium for storing a computer readable 
program, said program comprising: 

a first OS; 

a second OS different from said first OS; 

a multi-OS control program unit for managing computer 
resources, said multi-OS control program unit having 
an inter-OS communication function between said first 
OS and said second OS; 

a fault monitoring program unit for monitoring a fault of 
said first OS via said inter-OS communication function, 
said fault monitoring program unit operating on said 
second OS; 

an AP fault monitoring program unit for monitoring a 
fault of an AP operating on said first OS, said AP fault 
monitoring program unit operating on said first OS; and 

a fault monitoring program unit for monitoring a fault of 
said AP fault monitoring program unit besides a fault of 
said first OS via said inter-OS communication function, 
said fault monitoring program unit operating on said 
second OS. 

9. A storage medium for storing a computer readable 
program according to claim 8, wherein said multi-OS con- 
trol program unit provides a storage region, check-point 
information of said AP is stored in said storage region in 
response to a request from said AP, and information is 
preserved in said storage region by a fault and restart of said 
first OS as well. 
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